In this project, we will do exploratory data analysis using the Red Wind Quality data set.
This data set contains contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).This variable dictionary explains the variables in the data set and how the data was collected.
First, we need to get some basic understanding about the dataset. The dimension of the dataset is:
## [1] 1599 13
The variables in this data set are:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
The strucutre of this data set is as follow:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
As we can see from the structure of the data set, the X variable is the index of data set. We also have different quality for the red wine as follow:
## [1] 5 6 7 4 8 3
The basic statistics of the data set is as follow:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Generally, the fixed acidity (with median 7.90) is larger than the volatile acidity (with median 0.52). The density of red wine lies in a small range, with minimum value of 0.9901, maximum of 1.0037, and mean 0.9967. The alcohol of red wine ranges from 8.40 to 14.90, while the quality ranges from 3 to 8.
I plot the distribution of red wine quality to see how it looks like.
Most wine falls into quality 5 and 6, and the distribution looks like a normal distribution.
I also plot the distribution of each variables to see if these data also follow a normal distribution.
The distribution of fixed.acidity seems normal.
The distribution of volatile.acidity also seem normal with a little bit right tail.
citric.acid is skewed. Most locate in the left side.
residual.sugar is also highly skewed with a long right tail.
chlorides has similar distribution as citric.acid.
free.sulfur.dioxide is also skewed.
free.sulfur.dioxide is also skewed.
density is normal distribution.
pH is normal distribution.
sulphates has a long right tail.
alcohol has a right long tail.
While most variables follow normal distribution, some have long tails, such as critic.acid, total.sulfur.dioxide and residual.sugar.
For those variables with long tails, I transform them into log scale, and see what their log values look like.
log(citric.acid) doesn’t give us a normal distribution.
log(residual.sugar) still has a long right tail.
log(chlorides) seems like a normal distribution.
log(free.sulfur.dioxide) seems normal.
log(total.sulfur.dioxide) seems normal.
log(alcohol) seems no big difference.
total.sulfur.dioxide in log scale show a normal distribution. Other do not show significant changes in terms of distribution.
total.acidity seems normal.
There are 1,599 wines in the dataset with 13 features (X, fixed.acidity, volatile.acidity, critric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality). The variable X is the index of the dataset. The quality is an factor variable with levels: 3, 4, 5, 6, 7, 8.
Other observations:
The main features in the dataset are density, pH, and total.sulfur.dioxide. The quality of red wine may be predicted with the main features and some combination of the other variables.
fixed.acidity, volatile.acidity, sulphates, alcohol, residual.sugar, chlorides and total.sulfur.dioxide may also contribute to the quality of red wine.
I created a new variable total.acidity by adding the fixed.acidity and volatile.acidity together, because I think the total.acidity may better represent the quality of red wine.
Some of the features have long tails. Therefore, I tranformed them (e.g. total.sulfur.dioxide) into log scale to see if the log scaled variables better follow a normal distribuiton. The reason is that because the quality variable generally follow a normal distribution, variables with normal distribution may be better used to predict the quality of red wine.
I would like to see how different variables correlate with each other. So I plot the following correlation figure.
It look likes some variables are well correlated with each other. For further analysis, I plot some of them in a single figure, such as pH vs fixed.acidity and alchol vs quality. And some of them do show good correlations.
pH seems to be negatively correlated with fixed.acidity.
High quality red wine tends to have high alcohol.
High alcohol red wine seems to have low density.
density is highly correlated with fixed.acidity.
From the above plots, we can see the fixed.acidity and pH seem to be negatively correlated, while alcohol and density also seem negatively correlated. The density and fixed.acidity have a pretty good positive correlation.
The fixed.acidity has a negative correlation with the pH, which totally makes sense. However, the fixed.acidity also has a positive correlation with the density. And the density is negatively correlated with the pH. The quality has the highest correlation with the alcohol.
It looks like the total.acidity has a very high correlation with the fixed.acidity. Therefore, we may not need the total.acidity variable.
The free.sulfur.dioxde and total.sulfur.dioxide seem to correlate very well with a correlation coefficient about 0.67.
The fixed.acidity and citric.acid also has a postivie correlation of 0.67.
The density is negatively correlated with the alcohol with a coefficient of -0.50.
Except the variable I added total.acidity, the strongest relation is between fixed.acidity and pH, with a correlation coefficient of -0.68.
I start to focus on variables that are correlated and related to the quanlity of red wine, such as alcohol, density, and pH.
I try to plot them together to see if there’s any interaction between them, and wheter this interaction can affect the quality of red wine.
High quality red wine typically has high alochol and low density.
No obvious relation.
High quality red wine tends to have high alochol and low pH.
No obvious relation.
Low density and pH tends to have high quality red wine.
Good correlation between pH and fixed.acidity.
High qualiy red wine tends to have high alcohol and low density.
High qualiy red wine tends to have high alcohol, low density, and low pH.
High qualiy red wine tends to low total.sulfur.dioxide, low density, and high alcohol.
I find that high quality red wines typically have low density but high alcohol, or high fixed.acidity but high alcohol, or low pH but high alcohol, or low total.sulfur.dioxide but high alcohol.
High quality red wine tends to have high alcohol/density ratio.
High quality red wine tends to have high alcohol/pH ratio.
No obvious relation.
No obvious relation.
No obvious relation.
No obvious relation.
No obvious relation.
No obvious relation.
The quality of red wine can be better distinguished by the alcohol/pH ratio as well as alcohol/density ratio showing by the distribution and mean values in the boxplots.
High quality red wines typically have low density but high alcohol, or high fixed.acidity but high alcohol, or low pH but high alcohol, or low total.sulfur.dioxide but high alcohol.
In a word, high quality red wine generally corresponds low density, low pH, low total.sulfur.dioxide, high alcohol.
fixed.acidity and pH tend to weaken each other, which totally makes sense. While fixed.acidity and density tend to strengthen each other.
High quality red wine tends to have high alcohol/pH and alcohol/density ratio, which is very interesting.
The distribution of red wine quality apperas to be normal with most quality lie in levels 5 and 6.
The quality of red wine is related to the alcohol, density, pH and total.sulfur.dioxide. For example, the direct corrlateion between qaulity and alcohol is 0.48, while the correlation coefficients between alcohol and density is -0.50, between desnity and pH is -0.34, between alcohol and total.sulfur.dioxide is -0.21.
Generally, high quality red wine generally corresponds low density, low pH, low total.sulfur.dioxide, high alcohol.
The quality of red wine is more related to the alcohol/pH ratio. From the plot, we can see the means of alcohol/pH for quality levels 6-8 (about 3.2, 3.5, and 4.2) genearlly are greater than those of levels 3-5 (about 2.9, 3.0, and 2.9). Generally, high quality red wine has high alcohol/pH ratio.
The red wine data set contains contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The question I’m trying to answer is what factors can affect the quality of red wine. I started by understanding the basic structure and variables of the dataset, and then I explored the distributon of different variables in the dataset. I found the quality of red wine in the dataset generally follow a normal distribution, and many other variables also have the similar distribution, such as pH. For some non-nomral distribution variables, I even transformed them to log scale to better understand their distribution. Some variables in the dataset is correlated, such fixed.acidity and pH, which is totally makes sense. After many explorations with different variables, I found the most important variables for the red wine quality are alcohol, density, and pH. Through boxplots of alcohol/pH and alcohol/density ratios for each quality level, we can easily tell the relations between these variables. That is, high quality red wine generally has high alcohol/pH and alcohol/density ratios.
For future work, one can use the factors that are important for the red wine quality to build a model and predict the quality of red wine.